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For high dimensional statistical models, researchers have begun 
to focus on situations which can be described as having relatively few 
moderately large coefficients. Such situations lead to some very subtle 
statistical problems. In particular, Ingster and Donoho and Jin have 
considered a sparse normal means testing problem, in which they de- 
scribed the precise demarcation or detection boundary. Meinshausen 
and Rice have shown that it is even possible to estimate consistently 
the fraction of nonzero coordinates on a subset of the detectable re- 
gion, but leave unanswered the question of exactly in which parts of 
the detectable region consistent estimation is possible. 

In the present paper we develop a new approach for estimating 
the fraction of nonzero means for problems where the nonzero means 
are moderately large. We show that the detection region described 
by Ingster and Donoho and Jin turns out to be the region where it 
is possible to consistently estimate the expected fraction of nonzero 
coordinates. This theory is developed further and minimax rates of 
convergence are derived. A procedure is constructed which attains 
the optimal rate of convergence in this setting. Furthermore, the pro- 
cedure also provides an honest lower bound for confidence intervals 
while minimizing the expected length of such an interval. Simula- 
tions are used to enable comparison with the work of Meinshausen 
and Rice, where a procedure is given but where rates of convergence 
have not been discussed. Extensions to more general Gaussian mix- 
ture models are also given. 

1. Introduction. In many statistical applications such as analysis of mi- 
croarray data, signal recovery and functional magnetic resonance imaging 
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(fMRI), the focus is often on identifying and estimating a relatively few sig- 
nificant components from a high dimensional vector. In such applications, 
models which allow a parsimonious representation have important advan- 
tages, since effective procedures can often be developed based on relatively 
simple testing and estimation principles. For example, in signal and image 
recovery, wavelet thresholding is an effective approach for recovering noisy 
signals since wavelet expansions of common functions often lead to a sparse 
representation; the quality of the recovery depends only on the large coeffi- 
cients; the "small" coefficients have relatively little effect on the quality of 
the reconstruction, and thresholding rules are effective in identifying and es- 
timating the large coefficients. Likewise, in problems of multiple comparison 
where only a very small fraction of hypotheses are false, the false discov- 
ery rate (FDR) approach introduced by Benjamini and Hochberg [1] is an 
effective tool for identifying those false hypotheses. 

In these problems, the focus is on discovering large components. However, 
recently there has been a shift of attention toward problems which involve 
identifying or estimating "moderately" large components. Such terms can- 
not be isolated or detected with high probability individually. However it is 
possible to detect the presence of a collection of such "moderate" terms. For 
multiple comparison problems where there are a large number of tests to be 
performed, it may not be possible to identify the particular false hypotheses, 
although it is possible to discover the fraction of the false null hypotheses. 
For example, Meinshausen and Rice [14] discuss the Taiwanese-American 
Occultation Survey, where it is difficult to tell whether an occultation has 
occurred for a particular star at a particular time, but it is possible to esti- 
mate the fraction of occultations that have occurred over a period of time. 
In this setting, it is not possible to perform individual tests with high preci- 
sion, but it is possible to estimate the fraction of false nulls. Other examples 
include the analysis of Comparative Genomic Hybridization (CGH) lung 
cancer data [11], microarray breast cancer data [6, 10] and Single Nucleotide 
Polymorphism (SNP) data on Parkinson disease [13]. 

For such applications where there are relatively few nonzero components, 
it is natural to develop the theory with a random effects model; see, for 
example, Efron [6], Meinshausen and Rice [14] and Genovese and Wasserman 
[7]. Consider n independent observations from a Gaussian mixture model, 

(1.1) Xi = in + Zi, 2 i i -~"7V(0,l), l<i<n, 

where \ii are the random effects with P{\i% = 0) = 1 — e n , and given ^ 0, 
Hi ~ H for some distribution H. Equivalently we may write 

(1.2) X i l ^(l-e ri )N(0,l)+£ n G, l<*<n, 

where G is the convolution between H and a standard Gaussian distribu- 
tion. In these models, the problem of estimating the fraction of nonzero 
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terms corresponds to estimating the parameter e n , and we are particularly 
interested in the case where the signal is sparse and the nonzero terms \ii 
are "moderately" large (i.e., e n is small and |/^| < \J2\ogn). This general 
problem appears to be of fundamental importance. 

The development of useful estimates of e n along with the corresponding 
statistical analysis appears to pose many challenges. In fact this theory is 
already quite involved even in the apparently simple special case where H 
is concentrated at a single point /i n ; here fi n depends on n but not on i. In 
this case (1.2) becomes a two-point mixture model, 

(1.3) X i i -~-(l-e n )iV(0,l)+e„^ n ,l), 1 < i < n. 

In such a setting, the problem of testing the null hypothesis Hq : e n = 
against the alternative H a : e n > was first studied in detail in Ingster [8] , 
where (e n ,(i n ) are assumed to be known (see also [9]). Ingster showed that 
this apparently simple testing problem contains a surprisingly rich theory 
even though the optimal test is clearly the likelihood ratio test. Donoho and 
Jin [5] extended this work to the case of unknown (e n ,^ n ). It was shown 
that the interesting range for (e n ,fj, n ) corresponds to a relatively "small" e n 
and a "moderately" large /i n . A detection boundary was developed which 
separates the possible pairs (e n ,fj, n ) into two regions, the detectable region 
and the undetectable region. When (e n ,iJ, n ) belongs to the interior of the 
undetectable region, the null and alternative hypotheses merge asymptoti- 
cally and no test could successfully separate them. When (e n ,/i n ) belongs 
to the interior of the detectable region, the null and alternative hypotheses 
separate asymptotically. 

Although the theory of testing the above null hypothesis is closely related 
to the estimation problem we are considering, it does not automatically yield 
estimates of e n . In fact, the problem of estimating e n appears to contain fur- 
ther challenges which are not present in the above testing problem. Even the 
theory for consistent estimation of e n recently studied in Meinshausen and 
Rice [14] is quite complicated. Meinshausen and Rice [14] gave an estimate 
of e n and showed it to be consistent on a subset of the detectable region. 
They pointed out that "it is clear that it is somewhat easier to test for the 
global null hypothesis than to estimate the proportion," leaving the follow- 
ing question unanswered: what is the precise region over which consistent 
estimation of e n is possible? 

There are two primary goals of the present paper. The first is to develop in 
detail the theory for estimating e n in the two-point Gaussian mixture model. 
The theory given in the present paper goes beyond consistent estimation, 
and focuses on the development of procedures which have good mean squared 
error performance. Minimax rates of convergence are shown to depend on the 
magnitude of both fi n and e n ; upper and lower bounds for the minimax mean 
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squared error are given, which differ only by logarithmic factors; estimates of 
e n which adapt to the unknown fi n and e n are also given. These results make 
precise how accurately e n can be estimated in such a model. In particular, 
we show that it is possible to estimate e n consistently whenever (e n ,fi n ) is 
in the detectable region; and although the estimation problem is in some 
sense technically more challenging than the testing problem, the estimable 
region and detectable region actually coincide. 

The other major goal of the present paper is to show that the theory 
developed for the two-point mixture model leads to a one-sided confidence 
interval for e n , which has guaranteed coverage probability not only for the 
two-point mixture model, but also over the mixture model (1.1) assuming 
only that H > 0. In this general one-sided Gaussian mixture model, as noted 
in a similar context by Meinshausen and Rice [14], the upper bound for e n 
must always be equal to 1: the possibility that e n = 1 can never be ruled 
out because the nonzero \ii can be arbitrarily close to zero. For example, 
asymptotically it is impossible to tell whether all the //j are zero or all of 
them are equal to, say, 10~ n . On the other hand, if many "large" values of 
Xi are observed it is possible to give useful lower bounds on the value of e n . 
This is therefore an example of a situation where only one-sided inference 
is possible; a nontrivial lower bound for £ n can be given but not a useful 
upper bound. See Donoho [4] for other examples and a general discussion of 
problems of one-sided inference. In such a setting, a natural goal is to provide 
a one-sided confidence interval for the parameter of interest, which both 
has a guaranteed coverage probability and is also "close" to the unknown 
parameter. We show that such a one-sided confidence interval can be built 
by using the theory developed for the two-point model. 

The paper is organized as follows. We start in Section 2 with the two- 
point mixture model. As mentioned earlier, this model has been the focus 
of recent attention both for testing the null hypothesis that e n = and for 
consistent estimation of e n . These results are briefly reviewed and then a 
new family of estimators for e n is introduced. A detailed analysis of these 
estimators requires precise bounds on the probability of over-estimating e n , 
which can be given in terms of the probability that a particular confidence 
band covers the true distribution function. Section 3 is devoted to giving 
accurate upper bounds of this probability. In Section 4 we consider the 
implication of these results for estimating e n under mean squared error. 
Section 5 is devoted to the theory of one-sided confidence intervals over all 
one-sided mixture models. Section 6 connects the results of the previous 
sections to that of consistent estimation of e n , where comparisons to the 
work of Meinshausen and Rice [14] are also made. While the above theory is 
asymptotic, the discussion is continued in Section 7, where simulations show 
that the procedure performs well in settings similar to those considered by 
Meinshausen and Rice. Proofs are given in Section 8. 
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2. Estimation of e n in the two-point mixture model. In this section we 
focus on estimating the fraction e n under the two-point mixture model, 

(2.1) X J i -~-(l-e n )7V(0,l)+e„iV(/i n ,l), 1 < i < n. 

As mentioned in the Introduction, the problems of testing the null hypothesis 
that e n = and estimating e n consistently in the sense that P{ 1 1^ — 1 1 > 
5} — > for all 5 > have been considered. These results are briefly reviewed 
in Section 2.1 so as to help clarify the goal of the present work. A new family 
of estimators is then introduced in Section 2.2. Later sections show how to 
select from this family of estimators those which have good mean squared 
error performance, and those which provide a lower end point for a one-sided 
confidence interval with a given guaranteed coverage probability. 

2.1. Review of testing and consistency results. Ingster [8] and Donoho 
and Jin [5] studied the problem of testing the null hypothesis that e n = 0. 
It was shown that the interesting cases correspond to choices of e n and [i n 
where (e n ,fi n ) are calibrated with a pair of parameters (r,(3): e n = n~P and 
fj, n = \J2r logn, where 1/2 < (3 < 1 and < r < 1. Under this calibration it 
was shown that there is a detection boundary which separates the testing 
problem into two regions. Set 



(2.2) p*(J3) 




l/2</3<3/4, 
3/4</?< 1. 



In the (3-r plane, we call the curve r = p*{(3) the detection boundary [5, 8, 9] 
associated with this hypothesis testing problem. The detection boundary 
separates the (3-r plane into two regions: the detectable region and the un- 
detectable region. When ((3,r) belongs to the interior of the undetectable 
region, the sum of Type I and Type II errors for testing the null hypothe- 
sis that £ n = against the alternative (£ n = n /3 ,fi n = \/2rlogn) must tend 
to 1. Hence no test can asymptotically distinguish the two hypotheses. On 
the other hand when (/?,r) belongs to the interior of the detectable region, 
there are tests for which both Type I and Type II errors tend to zero and 
thus the hypotheses can be separated asymptotically. These two regions are 
illustrated in Figure 2, where a third region — the classifiable region — is also 
displayed. When (/3,r) belongs to the interior of the classifiable region, it 
is not only possible to reliably tell that e n > 0, but also to separate the 
observations into signal and noise. 

It should be stressed that this testing theory does not yield an effective 
strategy for estimating e n , though it does provide a benchmark for a theory 
of consistent estimation. Important progress in this direction has recently 
been made in Meinshausen and Rice [14], where an estimator of e n was 
constructed and shown to be consistent if r > 2(3 — 1. This estimator is 
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however inconsistent when r < 2(3 — 1. Note here that the separating line 
r = 2(3 — 1 always falls above the detection boundary. See Figure 2. The work 
of Meinshausen and Rice leaves unclear the question of whether consistent 
estimation of e n is possible over the entire detectable region. Of course, in 
the undetectable region no estimator can be consistent, as any consistent 
estimator immediately gives a reliable way for testing e n = 0. 

2.2. A family of estimators. The previous section outlined the theory de- 
veloped to date for estimating e n in the two-point Gaussian mixture model 
(1.3). The goal of the present paper is to develop a much more precise es- 
timation theory both for one-sided confidence intervals as well as for mean 
squared error. A large part of this theory relies on the construction of a fam- 
ily of easily implementable procedures along with an analysis of particular 
estimators chosen from this family of estimators. The present section focuses 
on providing a detailed description of the construction of this family of es- 
timators. Later in Sections 4 and 5 we will show how to choose particular 
members of this family to yield near optimal mean squared error estimates 
and one-sided confidence intervals. 

The basic idea underlying the general construction given here relies on 
the following representation for e n . Throughout the paper we shall denote 
by eft and respectively, the density and cumulative distribution function 
(c.d.f.) of a standard normal distribution. Suppose that instead of observing 
the data (2.1), one can observe directly the underlying c.d.f. F(t) = (1 — 
e n )$(t) + £ n $(t — (j, n ) at just two points, say r and t' with < r < r'. Then 
the values of e n and \i n can be determined precisely as follows. Set 

(2.3) D(jm; t, t') = [*(r) - $(r - //)]/[$(r') - $(r' - //)] . 

Lemma 8.1 in [2] shows that D(-;t,t') is strictly decreasing in \x > for any 
t <t' . The parameters e n and \x n are then uniquely determined by 

(2 .4) e - »W-*W md ^.tfrl-W 



*(t) - *(t - ft.) *M - F(t0 

It is easy to check that for r < r', 

mf^;r,r ) = ^ < ^ _ ^ < sup^^r.r ) = 

so by the monotonicity of £)(•; r, r'), we can first solve for // n from the right- 
hand side equation in (2.4), and then plug this \i n into the left-hand side 
equation in (2.4) for e n . 

In principle estimates of fi n and e n can be given by replacing F(t) and 
F(t ) by their usual empirical estimates. Unfortunately, this simple approach 
does not work well since the performance of the resulting estimate depends 



ESTIMATING SPARSE NORMAL MIXTURES 



7 



critically on the choice of r and t'. For most choices of r and r' the result- 
ing estimate is not a good estimate of e n in terms of mean squared error, 
although it is often consistent. Moreover, although there are particular pairs 
for which the resulting estimator does perform well, it is difficult to select 
the optimal pair of r and r' since the optimal choice depends critically on 
the unknown parameters e n and /i n . It is however worth noting that for 
the situations considered here the optimal choices of r and r' always satisfy 
< r < t' < V21ogn. 

The key to the construction given below is that, instead of using the 
usual empirical c.d.f. as estimates of F(t) and F(t'), we use slightly biased 
estimates of these quantities to yield an estimate of e n which is with high 
probability smaller than the true e n . It is in fact important to do this over 
a large collection of r and r' so that the entire collection of estimates is 
simultaneously smaller than e n with large probability. It then follows that 
the maximum of these estimates is also smaller than e n with this same 
high probability. This resulting estimate is just one member of our final 
family of estimates; other members of this family are found by adjusting the 
probability that the initial collection of estimators underestimates e n . The 
details of this construction are given below. 

First note that underestimates of e n can be obtained by overestimating 
F(t) and underestimating F(t'). More specifically, suppose that F + (t) > 
F(t) and F~(t') < F(t'). Then there are two cases depending on whether 
or not the following holds: 

<D(r) < $(t)-F+(t) < 4>(r) 



$(r') ~ $(V) -F~(t') ~ 0(r') 

If it does not hold, then the equation does not give a good estimate for 
/i n and we take to be an estimate for e n . If it does hold, then we can 
use (2.4) to estimate /i n by simply replacing F(t) and F(t') by F + (t) and 
F~(t'), respectively. Call this estimate fi n and note that fi n > fi n . It then 
immediately follows that the solution to the first equation in (2.4) with jl n 
replacing fi n yields an estimate i n of e n for which i n < e n . A final estimator 
is then created by taking the maximum of these estimators. 

Of course in practice we do not create estimators which always overesti- 
mate F(t) and underestimate F(t'), as there is also another goal, namely 
that these estimates are also close to F(r) and F(t'). To reconcile these 
goals it is convenient to first construct a confidence envelope for F(t). First 
fix a value a„ and solve for F(t): \fn \f n ^~ F ^ = a where F„ is the 

usual empirical c.d.f. The result is a pair of functions F^(t), 
(2-5) Ft® 



± m _ 2F "(*) + a n/ n ± Va 2 Jn + (4F n (t)-4F%(t)) ■ (a n /^E) 



2(l + a 2/n) 
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Note F-(t) < F(t) < F+(t) if and only if Jn \^f£M < a n . So for any 

S n ^ (—00,00) if we take a n to be the a-upper percentile of 
sup teSn {y/n -^^=^^ } , then F^(t) together give a simultaneous confi- 
dence envelope for F(t) for all t £ S n . For each a n the confidence envelope can 
then be used to construct a collection of estimators as follows. Pick equally 
spaced grid points over the interval [0, ^2 log n]: tj = (j — l)/y / 21ogn, 1 < 
j < 21og(n) + 1. For a pair of adjacent points tj and tj + \ in the grid let 

jj,a„ = fj-a}(tj,tj + i;n,&,F + ,F~) be the solution of the equation 



(2.6) D(n;tj,t j+1 ) 



*(t i+1 ) - Fa n {t 



when such a solution exists. If there is no solution set e,- = 0. Note that if 
a solution exists and F lies in the confidence envelope (2.5), then F^~ n {tj) > 

F(tj) and F~ n (tj + \) < F(tj + i) and hence > fi n . It then also follows that 



(2-7) 



£a„ < s. The final estir 
of {e®}: 



jj) 

satisfies £a n < £• The final estimator e a is defined by taking the maximum 



U) 



(2.8) e* = max i ( n : 

y ' an l<j<21ogn a 

3. Evaluating the probability of underestimation. A family of estima- 
tors depending on a n was introduced in Section 2 in terms of a confidence 
envelope. A detailed analysis of these estimators depends critically on upper 
bounding the probability of overestimating e n . Note that i* n underestimates 
e n whenever F lies inside the confidence envelope given in (2.5); hence upper 
bounds on overestimating e n can be given in terms of the coverage probabil- 
ity of the confidence envelope. In this section, we collect a few results that 
are useful throughout the remainder of this paper. Readers less interested 
in technical ideas may prefer to skip this section and to refer back to it as 
needed. 

A particularly easy way to analyze the confidence band given in (2.5) is 
through the distribution W* given by 



TT/-* d f r \F n (t)-F(t)\ \ 
W* = sup< v/n ' = > 

n tV JF(t)(l-F(t))! 



VF(t)(l-F(t))f' 

especially once we recall that the distribution of W* does not depend on 
F. More specifically, consider n independent samples Ui from a uniform 
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distribution U(0, 1). The empirical distribution corresponding to these ob- 
servations is then given by V n (t) = ^E"=i 1 {U 1 <t}- Set U n (t) = y/n[V n (t)-t], 
< t < 1, and write the normalized uniform empirical process as W n (t) = 
-l^ML. The distribution of W* can then be written as W* = sup t W n (t). 

The following well-known result [15] can be used to construct asymptotic 
fixed level one-sided confidence intervals for e n : 

W* V 

(3.1) lim ; n A 1. 

n->oo ^2 log log n 

Such an analysis underlies some of the theory in Meinshausen and Rice [14] 
but for the results given in our paper this approach does not suffice for 
reasons that we now explain. 

We are interested in estimators which underestimate e n with high prob- 
ability. These estimators correspond to choosing large a n and are used to 
construct estimators with good mean squared error performance. Unfortu- 
nately W* has an extremely heavy tail [5], 

lim w 2 P{W* >w} = C, 

so using W* to bound such tail probabilities only yields bounds on the 
chance that i* an exceeds e n which decrease slowly in a n . Such bounds are 
insufficient in our analysis of the mean squared error. The reason for this is 
that the heavy-tailed behavior exhibited by W* is caused by the tails in the 
empirical process and in our analysis we only consider values of t between 
and y / 2logn. Hence instead of looking at W* we may instead analyze the 
following modified version of W* : 

v d f r \F n (t)-F(t)\ 

(3.2) Y n = m ax < yn 



{0<t<V2l^}l VF(t)(l-F(t)) 

{F(0)<i<F( V '21ogn)}*L ^(l-t) ' 



which can be equivalently written as Y n = d max, . . , / —— {-^^L}. 



The problem here is that F(0) and F(yJ2\ogn) are unknown and depend 
on F, so we need a different way to estimate the tail probability of Y n . We 
suggest two possible approaches. The first one is clean but conservative and 
is particularly valuable for theoretical development. The second one has a 
more complicated form but is sharp and allows for greater precision in the 
construction of confidence intervals. In the first approach, write for the 
distribution of Y n where F corresponds to N(0, 1) and F n is the empirical 
c.d.f. formed from n i.i.d. N(Q, 1) observations. Then can be written as 

Wn = max <^ 

{l/2<t<<I'( A /21ogn)} I V*(l — t) 
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The following lemma shows the tail probability of any Y n associated with 
an F is at most twice as large as that of , uniformly for all Gaussian 
mixtures F of the form F(t) = /$(*- fi) dH with P{0 <H< y^logn} = 1. 



Lemma 3.1. Suppose that Y n is the distribution given in (3.2) where F 
is a Gaussian mixture Fit) = J <3?(i — //) dH with P{0 < H < y / 21ogn} = 1. 
Then for any constant c, P{Y n > c} < 2 • P{W^ > c}. 

The following tail bound for can be used to bound P(£* an > s n )- 

Lemma 3.2. For any constant cq > 0, for sufficiently large n, there is a 
constant C > such that P{W+ > c log 3/2 (ra)} < C ■ n -l-5co/V^F. 

It should now be clear why in our setting it is preferable to use such bounds 
since the corresponding tail behavior of W* satisfies P{W* > cq log 3 / 2 (n)} x 
C x (logn) -3 , which is not sufficient for our analysis of mean squared error 
given in the next section. 

In the second approach, note that F( v / 2logn) < $(\/2logn), and with 
overwhelming probability, F(0) > F n (0) — \Jcq log(n) / y/n. Now, for any con- 
stant cq > 0, define 



W+ + = W+ + {c Q )= max { 



\Un(t)\ 



{(^n(O)-V CO log(n) /v^)<<<*(\/ 21 °g n )} ^ VH 1 ^) 

The following lemma shows that the tail probability of Y n is almost bounded 
by that of , uniformly for all one-sided mixtures even without the con- 
straint that H < y / 2logn. 

Lemma 3.3. Suppose that Y n is the distribution given in (3.2) where F 
is a Gaussian mixture F(t) = f <3?(i — /i) dH with P{H > 0} = 1. Then for 
any constant c > and c, P{Y n > c} < P{W+ + > c} + 2n~ c ° ■ (1 + o(l)). 

This lemma is particularly useful in the construction of accurate confi- 
dence intervals where we take cq = 3 so that the difference between the two 
probabilities is 0(n~ 3 ). Without further notice, we refer W^~ + to the one 
with Co = 3. Lemmas 3.1-3.3 are proved in [2], Sections 8.2-8.4. 

3.1. Choice of a n in later sections. Different choices of a n lead to dif- 
ferent estimators of e n . We shall choose a n depending on the purpose. In 
Section 4 the focus is on optimal rates of convergence for mean squared 
error. For this purpose it is convenient to choose a relatively large a n [i.e., 
4"v/2~7r log 3 / 2 (n)]. In Section 6, where the focus is on consistency, a much 
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smaller a n is also sufficient and might be preferred. Finally, the interest of 
Section 5 is on one-sided confidence intervals, and here we wish to choose 
an a n with level a = P{Y n > a n } being fixed. The difficulty here is that, 
different from the above two cases, the a n depends on the unknown -F(O) 
and i ? (y / 21ogn). Fortunately, the level a is fixed and specified before hand, 
so one can use simulated values of to approximate a n without much 

computational complexity. 

4. Mean squared error. In this section, we focus on choosing a member 
of the family of estimators constructed in Section 2.2 which has near optimal 
mean squared error properties. More discussion is given in Section 7 where 
a simulation study provides further insight into the mean squared error 
performance of these estimators. Our analysis begins with the bound 



E 



1 



< 



— ) 



>e n ) +E 



1 



1 



> 



There is a tradeoff depending on the choice of a n . As a n increases P(e 
e n ) decreases but when e* an underestimates e n it does so by a greater amount. 
It is thus desirable to choose the smallest a n so that the first term is negligible 
and this in fact leads to an estimator with near optimal performance. It 
should be stressed that in the construction of the smallest such a n the precise 
bounds given in Lemma 3.2 are important and the tail bounds for W* do 
not suffice. In particular Lemma 3.2 shows that a n = 4v / 2~7rlog 3 ^ 2 (n) suffices 
to make this first term negligible. For such a choice, the following theorem 
gives upper bounds on the minimax risk. 

Theorem 4.1. Suppose F(t) = (1 - e n )$(t) + e n $(t - fi n ) with e n = 
n~P , jj, n = v / 2r logn, where < r < 1, ^ < (3 <1, and r > p*(j3) so that (/?, r) 
falls into the interior part of the detectable region. Set a n = 4\/27rlog 3 ^ 2 (n). 
The estimator i* n defined in (2.8) satisfies 

* ~i 2 



E 



(4.1) 



1 



6, 



5.5„-l-2r+2/3 



< 



C(r,(3) (log n) 5 - 5 n 

C(J3,r) (log n )5.5 n -l+(/3+r)7(4r) ) 
C(r,/3) (log n) 4 n- 1+/3 , 



when /3>3r, 
when r < (3 < 3r, 
when /3 < r, 



where C((3,r) is a generic constant depending on (f3,r). 



Theorem 4.1 gives an upper bound for the rate of convergence of e* . 
Although this estimator usually underestimates e n , the lower bounds for the 
mean squared error given below show that the performance of the estimator 
cannot be significantly improved. 
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Although the lower bounds given below are based on a two-point testing 
argument we should stress that they do not follow from the testing theory 
developed in [8]. In particular the detection boundary mentioned in Section 
1 is derived by testing the simple hypothesis that e n = against a particular 
alternative hypothesis. Here we need to study a more complicated hypothesis 
testing problem where both the null and alternative hypotheses correspond 
to Gaussian mixtures. More specifically, let X\, . . . ,X n P and consider 
the following problem of testing between the two Gaussian mixtures: 

H : P = P = (1 - eo tn )N(0, 1) + e , n N^ , n , 1) 

and 

H 1 :P = P 1 = (1- ei, n )JV(0, 1) + ei,„JV(Mi,n, !)■ 

Minimax lower bounds for estimating e n can then be given based on care- 
fully selected values of £o, n > £i,r»> Mo,n and \i\ >n along with good bounds on 
the Hellinger affinity between n i.i.d. observations with distributions Pq an 
Pi. As is shown in the proof of the following theorem, these bounds require 
somewhat delicate arguments. We should mention that our attempts using 
bounds on the chi-square distance, a common approach to such problems, 
did not yield the present results. The lower bounds are summarized as fol- 
lows. 

Theorem 4.2. LetXi,...,X n L ~' (l-e n )JV(0, l) + e n N(n n , 1). ForO< 
r < 1, \ < j3 < 1, ai,a2 > and hi > b\ > 0, set O n = {(e n ,/j, n ) : bin' 13 <e n < 
b 2 n-P,JW^-^<ii n <JW^ + ^}. Then 

fi s2 f C(logn)n- 1 - 2r+2 whenf3>3r, 
inf sup E[—-l) > < C(logn) 5 / 2 n- 1+ ^+ r ) 2 /(^) 5 when r < (3 <3r, 
^(e n ,^)€Q n V£n J [Cn- 1+ P, when 13 <r. 

A comparison between the upper bounds given in Theorem 4.1 and the 
lower bounds given in Theorem 4.2 shows that the procedure i* an has mean 
squared error within a logarithmic factor of the minimax risk. Additional 
insight into the performance of this estimator is given in Section 6 where 
comparisons to an estimator introduced by Meinshausen and Rice [14] are 
made and in Section 7 where we report some simulations results. 

5. One-sided confidence intervals. In the previous section we showed 
how to choose a n so that the estimator e* has good mean squared error 
properties. In the present section we consider in more detail one-sided con- 
fidence intervals. For such intervals there are two conflicting goals. We want 
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to maintain coverage probability over a large class of models while minimiz- 
ing the amount that our estimator underestimates e n . More specifically, the 
goal can be formulated in terms of the following optimization problem: 

Minimize E(e n — i n )+ subject to supP(e n > e n ) < a, 

T 

where J- is a collection of Gaussian mixtures. A similar formulation for the 
construction of optimal nonparametric confidence intervals is given in Cai 
and Low [3]. 

In the present section we focus on this optimization problem for the class 
of all two-point Gaussian mixtures showing that the estimator e* with an 
appropriately chosen a n provides an almost optimal lower end point for 
a one-sided confidence interval with a given coverage probability. Perhaps 
equally interesting is that this one-sided confidence interval maintains cov- 
erage probability over a much larger collection of Gaussian mixture models, 
namely the set of all one-sided Gaussian mixtures with H > 0. See also Sec- 
tion 6.3 where we briefly discuss how the condition H > can be dropped. 

5.1. Coverage over one-sided Gaussian mixtures. In this section we show 
how one-sided confidence intervals with a given coverage probability can be 
constructed for the collection of all one-sided Gaussian mixtures (1.1) with 
H > 0. Let T be the collection of all one-sided Gaussian mixture c.d.f.s of the 
form (1 — e)"P(i) + eG where G(t) = J <P(t — fi) dH is the convolution of <I> and 
a c.d.f. H supported on the positive half-line. For arbitrary constants < 
a < b < 1 and < r < r', out of all c.d.f.s F 6 T passing through points (r, a) 
and (t',6), the most "sparse" one (i.e., smallest e) is a two-point Gaussian 
mixture F*(t) = (1 — £*)«p(i) + £*<P(i — //*), where (e*,//) are chosen such 
that F*(t) = a and F*(t') = b. That is, 

fi* : solution of D(u;t,t') = — 7^7- — - and 

"Pit ) — 

(5.1) 

Hr) - a 
$(t) - *(t - /i*) ' 

where the function D is given in (2.3). The following lemma is proved in [2], 
Section 8.7. 

Lemma 5.1. Fix < a <b < 1, 0<t<t', and < s < 1. For any F = 
(1 - e)"P(i) + eG G T such that F(r) = a and F(t') = b, define e* by (5.1). 
Then e* < e. 

See Figure 1. 

We now turn to the coverage probability of the grid procedure e* over 
the class T . Fix an F . Then for each pair of adjacent points (tj,tj+x) 
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in the grid, the above lemma shows that there is a two-point Gaussian 
mixture F*(t) = (1 - e*)$(t) + e*$(t - fi*), where (spltf) are chosen such 
that F*(tj) =F(tj) and F*(t j+ i) =F(t j+ i). It is clear that e* depends on 
the points tj and tj+i, but Lemma 5.1 shows that in each case e* < e. Now 
suppose that F lies inside the confidence envelope defined by (2.5). In this 
case it follows that ia] defined by (2.7) satisfies £«„ < £*j and hence also 

£oL — £ n- Since this holds for all j, it then immediately follows that e* < e n 
whenever F lies inside the confidence interval defined by (2.5). A given level 
confidence interval can then be given based on the distributions of W£ and 
W^ + - This result is summarized in the following theorem. 

Theorem 5.1. Fix < a < 1 and let a n be chosen so that P(W+ > 
a n) < ct/2. Then uniformly for n and all one-sided Gaussian location mix- 
tures defined in (1.2) with P(0<H< y^logra) = 1, P{i* an < e n } > (1 - a). 
Moreover, let a n be chosen so that P(W^ + > a ri ) < a. Then as n — > oo, 
uniformly for all one-sided Gaussian location mixtures defined in (1.2) with 
P{H > 0} = 1, P{e* an < e n } > (1 - a)(l + o(l)). 

5.2. Optimality under two-point Gaussian mixture model. In the previ- 
ous section we focused on the coverage property of the one-sided confidence 
interval over the general class of one-sided Gaussian mixtures. In this section 



1 1 1 1 1 


i i 
y 

y 

y 

y 

y 


— — y 

^ (x,a) 

i i i j i 


(x',b) 



1 1.5 2 2.5 3 3.5 4 4.5 5 



Fig. 1. In the c.d.f. plane, among the family of all one-sided Gaussian location mixtures 
which pass through two given points (r, a) and (t ,b), the most sparse mixture is a two-point 
mixture (the solid curve) which bounds all other c.d.f .s from above over the whole interval 
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we return to the class of two-point Gaussian mixtures and study how "close" 
the lower confidence limit e n is to the true but unknown e n . In particular we 
compare the performance of our procedure with the following lower bound. 

Theorem 5.2. LetXi,...,X n L & (l-e re )iV(0,l) + £ n iV(/x n ,l). ForO< 
r < 1, | < /3 < 1, ai,ej2 > and b 2 >b\> 0, set £l n = {(e n ,p n ) : bin^ 13 <e n < 
b 2 n-l 3 ,^2rlogn- ^ < fi n < ^2r iogn + For < a < |, Zet e n be a 

(1 — a) level lower confidence limit for e n over Q n , namely, infn n P{e n > 
£n} > 1 — a. T/ien 

inf sup E ( 1 -J 

( C(logn) 1 /2 n -V2-r+/3 ; wften ^ > 

> < C(logn) 5 / 4 n- 1 /2+(/3+0 2 /(8r) ) when r < /3 < 3r; 
I Cn" 1 /^/ 2 , w/ierc < r. 

Theorem 5.2 shows that even if the goal is to create an honest confidence 
interval over the class of two-point Gaussian mixture models the resulting 
estimator must underestimate the true e n by a given amount. The following 
theorem shows that the estimator given in the previous section which has 
guaranteed coverage over the class of all one-sided Gaussian mixture models 
is almost optimal for two-point Gaussian mixtures. 

Theorem 5.3. Suppose F is a two-point mixture F(t) = (1 — s n )<fr(t) + 
e n <&(t — u n ) with e n = n' 13 , a n = y / 2r logn, where < r < 1, \ < (3 < 1 and 
r > p* (/?) so ((3, r) falls into the interior part of the detectable region. Fix 
< a < 1 and let a n be chosen so that either P{W^ > a n ) < ^ or such that 
P{Wn + > a n} < a and for this value of a n let i* an be the estimator defined 
in (2.8). Then there is a constant C = C((3,r) > such that 

!C ■ v/log log(n) • (logn) 5//4 • n~ l l 2 ~ r+l3 , when [3 > 3r, 

C • Vloglog(w) • (bgn) 5 / 4 • n -V2+(/3+r) 2 /(8r), when r < p < 3r> 

C ■ yf\og\og(n) ■ n~ l l 2+l3 / 2 , whenj3<r. 

6. Discussion. In this section we compare and contrast the methodology 
developed in the present paper to the approach taken by Meinshausen and 
Rice [14]. The goal is to explain intuitively some of the theory developed in 
these two papers. Both methods have a root based on the idea of "threshold- 
ing," and how well each method works can partially be explained in terms 
of the concept of most informative threshold. 
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We shall start with a general comparison of the two estimators. It is useful 
to note that the stochastic fluctuations of these estimators are not larger 
in order of magnitude than the bias. It is thus instructive for a heuristic 
analysis to replace each of these estimators by nonrandom approximations. 
The approach taken in Meinshausen and Rice [14] starts with a more general 
mixture model which after a transformation can be written as 

y 4 i ~-(l-e n )7V(0,l)+£nF, l<i<n, 

where F is an arbitrary distribution. In that context one-sided bounds are 
given for e n which hold no matter the distribution of F. The lower bound 
can be thought of first picking an arbitrary threshold t, then comparing the 
fraction of samples > t with the expected fraction > t when all samples are 
truly from iV(0,l); the difference between two fractions either comes from 
stochastic fluctuations or from the signal, which thus naturally provides a 
lower bound if the stochastic fluctuations are controlled. 

Using our notation, Meinshausen and Rice's lower bound can be written 
as e^, R = su P{ „ 00<t<00} e^ R (t; F n ), where 



(6.1) eT(t;F n ) 



$(t) - F n (t) - (a*Jy/n) ■ V$(t)(l-$(t)) 



Here ajj > is a constant which plays a similar role as a n in our estima- 
tor, and without loss of generality, we chose \/y/t{\ — t) as the bounding 
function [14]. A useful approximation to this estimator is given by neglect- 
ing the stochastic fluctuation where we replace F n by F. The result is the 
approximation i^^{t;F), 



(6.2) 



i^(t;F n )^e^(t-F) 



$(t) - F(t) - • V$(t)(l-$(t)) 



*(*) 

It is instructive to compare this approximation with the following slightly 
modified version of our estimator where we neglect the stochastic difference 
by replacing fij by /x n and where we approximate F + by F + -^a= \/F(l — F). 

Then the estimator e* n can be approximated by e* n « su P{Q< t <^/ 2 iogn } 
where 

re si t a f) = m - F{t) - {an/VE) " ^ F( ^ )(1 ~ F{t)) 

It is now easy to compare (6.2) with (6.3). There are three differences: 
(i) we use <&(i) — Q(t — fx n ) as the denominator instead of <&{t)\ (ii) we 
use y/F(t)(l — F(t)) rather than y/$>(t)(l — $>(t)) for controlling stochastic 
fluctuation; (hi) we take the maximum over (0, \/2\ogn) instead of (— oo, oo). 
In fact, only the first difference is important in the analysis of the two-point 
mixture model. 
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6.1. Consistent estimation. In this section we compare the approxima- 
tions for the two-point mixture models starting with the Meinshausen and 
Rice procedure [14]. We have 



+ai.n^ i i 2 .^(i-m)im 



(6.4) l-e^(t,F)/e n 
and in order for fM R to be consistent, we need a t such that 

(6.5) ^^^^-0 and a* n n^ ■ ^(1 - *(*))/*(*) - 0. 



It is easy to check that both these conditions hold only if y/2{2(3 — 1) logn < 
t < fi n and that this is only possible when r > 2(3 — 1 . Hence the Meinshausen 
and Rice procedure is only consistent on a subset of the detectable regions. 
Note here that consistency requires a constraint on t, namely that t should 
not exceed fi n regardless of the value of (3. 

A similar analysis can be provided for the approximation of our estima- 
tor. Since we use the term <£>(£) — <£(i — /x n ) as the denominator in (6.3) 
instead of &(t), the above restriction on the choice of t for Meinshausen 
and Rice's lower bound does not apply to our estimator. In fact we should 
always choose t to be greater than fj, n , not smaller; see Table 1 for the most 
informative t. This extra freedom in choosing t yields the consistency over a 
larger range of (/3,r). In fact for the two-point Gaussian mixture model the 
following theorem shows that our estimator is consistent for e n over the en- 
tire detectable region and in this sense the estimator is optimally adaptive. 



Theorem 6.1. Let SI be any closed set contained in the interior of the 



detectable region of the (3-r plane: {(/?, r) : 
sequence of a n such that a n / v / 21og logn 
then for all 5 > 0, 



lim sup P 

n ^°°{(/3,r)eO} 



s (/3)<r<l,i</3<l}. For any 
1 and P{W+ >a n } tends to 0, 



> 5 



Figure 2 plots on the (3-r plane the detection boundary which separates 
the detectable and undetectable regions, and the classification boundary 
which separates classifiable and unclassifiable regions. When (/?, r) belongs 
to the classifiable region, it is also able to reliably tell individually which are 
signal and which are not. The dashed line is the separating line of consistency 
of the Meinshausen and Rice lower bound: above which the lower bound is 
consistent to e n , below which it is not; see Meinshausen and Rice [14]. The 
right panel of Figure 2 shows seven subregions in the detectable region as 
in Table 1 given in Section 6.2. 
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Fig. 2. Left panel: The detection boundary and the classification boundary together with 
the separating line of consistency of Meinshausen and Rice (dashed line). Right panel: 
seven subregions in the detectable region as in Table 1. 



6.2. Most informative threshold. In this section we turn to an intuitive 
understanding of the mean squared error property which is driven by the 
value of t that minimizes (6.3). More specifically, if we ignore the log-factor, 
the mean squared error of the estimator given by the approximation in (6.3) 
for a fixed t satisfies 

F(t)(l-F(t)) 



(l-i: n (t,F)/e n y 



n 



2/3-1 



[m -<s>(t-ft n W 

Minimizing this expression over t yields the optimal rate of convergence as 
given in Theorem 4.1. We call the minimizing value of t the most informative 
threshold and these values are tabulated in Table 1. Although the mean 
squared error performance of the Meinshausen and Rice procedure has not 
been computed it appears likely that a similar phenomenon holds. In this 
case, 



1 



(t,F)/e n y 



S(t) 



+ n 



£-1/2 . 



(1 - *(*))/*(*) 



and the value of t which minimizes these expressions ~ (2 — [2 — 2 ^^ l ] l / 2 )ji n . 
Here we have assumed r > 2/3 — 1 as otherwise the estimator is not consistent 
and the most informative t is not of interest; see Table 1. This shows that 



[I 



;MR 



{t,F)/e r 



~ n 



which should give the correct convergence rate for the mean squared error. 
Here we have also omitted a log- factor. Since this convergence rate is always 
slower than the optimal rate of convergence given in Theorem 4.1, it appears 
at least according to this heuristic analysis that the optimal rate is never 
achieved by Meinshausen and Rice's estimator. One possible reason for the 
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Table 1 

Most informative threshold for Meinshausen and Rice's procedure and the newly proposed 
procedure and higher criticism of Donoho and Jin [5]. The labels of region are 
illustrated in the right panel in Figure 2 
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slow convergence rate is that in the analysis of the Meinshausen and Rice 
procedure the most informative t* never exceeds /i n , whereas for our pro- 
cedure the most informative t* is never less than /i n . The most informative 
thresholds are summarized in Table 1. Note that when r < 2(3 — 1, the Mein- 
shausen and Rice lower bound is not consistent, so the most informative 
threshold is not of interest (NOI). Detailed discussion on higher criticism 
can be found in [5]. 

6.3. Extensions and generalizations. We should stress that although the 
procedure presented in the present paper has better mean squared error 
performance than that of Meinshausen and Rice, the advantage of Mein- 
shausen and Rice's lower bound is that it does not assume any distribution 
of non-null cases. In this section, we address some possible extensions of the 
Gaussian model which may also shed further light on the approach taken in 
the present paper. 

Let {/(x; n)-n> 0} be a family of density functions and let X\, . . . , X n 
be a random sample from a general one-sided mixture: 

X u . . . ,X n 4o d - (1 - £n )f(x; 0) + e n J f(x; fi) dH(p),P(H > 0) = 1. 

Two key components for the theory we developed in previous sections are: 

(A) among all cumulative distribution functions passing through a given pair 
of points (r, a) and (r',b), the most sparse one is a two-point mixture, and 

(B) the proposed estimator is optimally adaptive in estimating e n for the 
family of two-point mixtures. We expect that our theory can be extended 
to a broad class of families where (A) and (B) hold. 

We have shown in an unpublished manuscript that two conditions that 
suffice for (A) to hold are: (Al) the family of density functions is a strictly 
monotone increasing family: f(x;fj,)/f(x) is increasing in x for all [i > 0, 
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and (A2) D(/x;r, r') is strictly decreasing in /j, > for any r' > r > where 

D(fi;r,r') = p(r''!o)~F^'% and is tne c - d - f - corresponding to 

It is interesting to note that the two-sided Gaussian mixture satisfies the 
above mentioned conditions. In fact, for X from a two-sided Gaussian mix- 
ture, |X| can be viewed as a one-sided mixture from the family of densities 
where f(x; fi) = <j)(x — fi) + <f){x + /i) — 1. It appears that (B) also holds in 
this case although we leave a more detailed analysis for future study. 

7. Simulations. We have carried out a small-scale empirical study of 
the performance of our lower bound along with a comparison to Mein- 
shausen and Rice's lower bound for sample sizes similar to those studied 
by Meinshausen and Rice. The purpose of the present section is only to 
highlight a few points that occurred consistently in our simulations. One of 
the points chosen in our study corresponded to ((3,r) = (4/7, 1/2). This pa- 
rameter is in a region where both Meinshausen and Rice's lower bound and 
our lower bound are consistent. In our experiment, we simulated n samples 
from a c.d.f. F(t) = (1 - £•„)$(*) + $(t - /i n ), where n = 10 7 , e n = 1CT 4 and 
[i n = y/2 x 0.5 x logn ~ 4. The reason we chose such a large n is that the 
signal is highly sparse. In fact, with the current (3 and n, the number of 
signals is about 1000. 

The experiments started by calculating a n -percentiles by simulation for 
W* needed for the Meinshausen and Rice procedure and for Y n for our pro- 
cedure. Denote the percentiles by a* and a n , respectively, so that P(W* > 
a*) = a n , and P(Y n > a n ) = a n . Since Y n depends on the unknown pa- 
rameter -F(O), we replace Y n by as in Lemma 3.3. The simulated 
data indicate that the difference between and is negligible and 
P(Wn — a n) ~ a„, so a convenient way to calculate a n is through in- 
stead of Y n . We then generated 5,000 simulated values of W* and W+, 
and calculated the values of a* and a n corresponding to eight chosen levels 
a n = 0.5%, 1%, 2.5%, 5%, 7.5%, 10%, 25% and 50%. The values are tabulated 
in Table 2. 

Next, we laid out grid points for calculating the lower bound i* an . Since 
2 logn = 32.24, we chose 33 equally-spaced grid points: tj = (j — l)/ v / 21ogn, 
1 < j < 33. We then ran 3,500 cycles of simulation. 

• In each cycle we drew n ■ (1 — e n ) samples from N(0, 1) and n • e n samples 
from N(fj, n , 1) to approximate n samples from the two-point mixture (1 — 
e„)iV(0,l)+e re iV(u n ,l). 

• For each a n , we used the above simulated data and the grid points to 
calculate i* . 

Lin 

• For each at, we used the simulated data to calculate eM R . 
The results are summarized in Table 2, as well as Figure 3. 
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Table 2 

Comparison of our lower bound with Meinshausen and Rice's lower bound. The 
comparison is based on 3, 500 independent cycles of simulations. In each cycle, we 
simulated n = 10 7 samples from a two-point mixture with e n = 10 4 and 
fi n = y/2 x 0.5 x logn « 4. The lower bounds were calculated for each of the eight chosen 
a„-levels. The unsatisfactory performances of Meinshausen and Rice's lower bound are 
displayed in boldface, and are caused by its heavy-tailed behavior 



0.005 0.01 0.025 0.05 



0.075 



0.10 



0.25 



0.50 



l /e n 



yj2 log logn 
P(i n > En) 

Maximum 
Mean 
Median 
Deviation 
E[^-lf 
£(!-&)+ 



2.126 1.956 1.699 1.545 1.467 



yj2 log logn 
P(i n > £„) 

Maximum 
Mean 
Median 
Deviation 
_1]2 

E(l - §*)+ 





0.654 
0.456 
0.450 
0.045 
0.299 
0.545 





0.787 
0.477 
0.471 
0.049 
0.276 
0.523 



0.0014 

1.063 

0.516 

0.508 

0.062 

0.238 

0.485 



0.0026 

1.907 

0.544 

0.531 

0.085 

0.215 

0.458 



0.0043 

2.485 

0.560 

0.546 

0.1015 

0.204 

0.442 



6.830 3.731 2.382 1.826 1.657 





0.309 
0.252 
0.251 
0.018 
0.560 
0.748 





0.473 
0.374 
0.373 
0.027 
0.393 
0.626 





0.643 
0.477 
0.472 
0.041 
0.276 
0.523 



0.002 

1.337 

0.5457 

0.537 

0.065 

0.211 

0.455 



0.007 
31.46 

0.6017 

0.562 

0.795 

0.791 

0.426 



1.370 

0.0077 
3.215 
0.583 
0.562 
0.127 
0.190 
0.421 

1.557 

0.013 
321.9 
0.836 
0.579 
7.765 
60.31 
0.405 



1.158 

0.026 
4.794 
0.651 
0.608 
0.211 
0.167 
0.364 

1.285 

0.101 
1113 

5.644 

0.639 
43.04 
1873 

0.315 



0.940 

0.114 
6.418 
0.776 
0.677 
0.373 
0.189 
0.285 

1.087 

0.290 
1781 
26.158 
0.739 
123.2 
15814 

0.214 



We draw attention to a number of features which showed up not only in 
this simulation but in our other simulations as well. First, the distribution of 
e* je n has a relatively thin tail. Figure 3 gives histograms of e* je n which 
show that when it does overestimate, it only overestimates by a factor of 
at most 5 or 6. Moreover, the chance of underestimation is in general much 
smaller than a n , sometimes even 10 times smaller, which suggests the theo- 
retical upper bound for overestimation in Theorem 5.1 is quite conservative. 
For example, column 7 of Table 2 suggests for a n = 25% the empirical prob- 
ability of overestimation ~ 2.6% which is roughly 10 times smaller. Finally, 
when it does underestimate, the amount of underestimation is reasonably 
small. In addition, the risk E{[e* an /e n ] — l) 2 and E(l — [e* n /e n ]) + are also 
reasonably small. We also note that Meinshausen and Rice's lower bound 
displays a heavy-tailed behavior; it can sometimes overestimate e n by as 
much as 1,100 times. 

The performance of i* a is not very sensitive to different choice of a n 
(or equivalently a n ). As a n gets larger, slowly, the mean and median of 
e* n increase, and E([e* n /e n ] — l) 2 and E(l — [e* an /e n \) + decrease, which 
suggest a better estimator for a larger a n in a reasonable range, for example, 
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Fig. 3. Histograms for 3,500 simulated ratios between lower bounds and the true e„. 
The simulations is based on 10 7 samples from two-point mixture with e n = 10 -4 and 
[L n = \J2 x 0.5 x logn ~ 4. Top row: our lower bound. Bottom row: Meinshausen and 
Rice's lower bound. From left to right, lower bounds correspond to different a n level: 
0.005, 0.05 and 0.25. The last column is the log-histogram of the third column. 



a n < 50%. The phenomenon can be interpreted by the thin tail property as 
well as that fact the chance of overestimation is slim: a larger a n will not 
increase much of the chance of overestimation, but it will certainly boost 
the underestimation and in effect make the whole estimator more accurate. 
We now turn to Meinshausen and Rice's lower bound. e]^ R also provides 

an honest lower bound, and P(e^ R > e n ) is typically much smaller than 
a n . However, for relatively larger a n , empirical study shows that e^l R is 
not an entirely satisfactory lower bound as the variance of £^ R is relatively 

large. For example, when a n > 0.1, E(-j 1 l) 2 can be as large as a few 

hundred or a few thousand; see the cells in boldface in the table. Even for 
smaller a n , is slightly worse than e*^ if we compare the mean, median, 

-E([e^ R /e n ] — l) 2 and risks, and so on, which suggests is not as accurate 
as S n 

The large variance of is caused by its heavy-tailed behavior. We 

have plotted the histograms of e^S R /e n . In some circumstances, e^S R can 
overestimate e n by a factor of several hundred or even larger, and a larger- 
scale study shows that this phenomenon will not disappear just by taking a 
smaller a n . 

Naturally, one wonders what causes such heavy-tailed behavior and how 
to modify e^ R such that it preserves the good property of e^ R and with a 
relatively thin tail. Recall that ([14]) 



(7.1) £ ^ R = sup 

0<t<l 



F n (t)-t-(aZ/Jn)-y/t(T=£) 



1-t 
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the heavy-tailed behavior of ej* is mainly caused by the denominator term 
(1 — t), which can become extremely small as t gets closer to 1. We recom- 
mend dropping the term in the denominator and using the following as a 
lower bound: 

= sup [F n (t) - t - «/v^) • y/t(l-t)]. 
o<t<i v 

Clearly this is still a lower bound which is a little bit more conservative 
than e^ R . However whenever the maximum in (7.1) is reached at t « 0, the 
difference between and e^* is small. The advantage of this procedure is 
that it has a thin tail. 

8. Proofs. 

8.1. Proof of Theorem 4.1. Before going into technical details, we briefly 
explain the main ideas behind the proof. First note that there are two major 
contributions to the risk: one part due to overestimating e n and the other 
part due to underestimating e n . By selecting a n as large as 4v / 27rlog 3 / 2 (n), 
the probability of overestimating is so small that the first part is negligible. 
It is thus sufficient to limit our attention to the event where the estimator 
underestimates e„. Now recall that the estimator e* is the maximum of a 

(i) 

collection of individual estimators £a„ , each of which is based on a pair of 

(i) 

adjacent grid points tj and ij+i- Comparing e 0n with sd n , it is clear that the 
component of the risk due to e* an underestimating e will not exceed that of 

any ; hence we can choose any such estimator to give us an upper bound 
for this component of the risk. 
In detail, let i* = v / 2glogn with 

r 4r, (5> 3r, 

(8.1) q= I (/3 + r) 2 /(4r), r < (3 < 3r, 

I r, /3 <r. 

The particular j = jo we would like to choose is the one which satisfies 
tj < £*j < tj 0+ i. To elaborate the above observations, we denote the event 
{F~ n (t) < F(t) < F+ n (t), V < t < v / 21ogn} by A an . First, note that for a n = 
4^/^flog 3/2 (n), Lemma 3.2 implies that P{(A an ) c ) < 0(l/n 3 ). It then fol- 
lows that in the bound for the risk given by eC^- - l) 2 < (— ) 2 P((A an ) c ) + 
E([i* n /e n — l] 2 • l{Aa n j), the first term is negligible. Second, note that 
£a„ o) < e* an < e n over A a " , so 

(8.2) E([e* a Js n - l] 2 • \ {A a n} ) < E([i^)/ £n _ i] 2 . 1{Aan} ). 
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Finally, the key inequality we need to show is 

(8.3) 



< 



SV n/ ; ($(t*)-F(t*))2' 

c{aJn) m* n )-F(t* n )r p - r - 

In fact, Theorem 4.1 follows directly by combining (8.2)-(8.3) with the fol- 
lowing lemma in which we calculate )(1 - F(i*))]/[$(£* ) - F(£* )] 2 . 

Lemma 8.1. Suppose F(-) = (1 — £„)$(•) + e n <I>(- — /i n ) e n = n _/3 , 
/i n = y/2r\ogn, where 1/2 < f3 < 1, and r > p*(/3) so (/3,r) /a//s above the 
detection boundary. With i* defined in (8.1), 

F(t*)(l-F(t*)) 

Vvrrlogn • n" 2r+2 ^ • (1 + o(l)), /3 > 3r, 

^fc^V4^Toi^-n^ +r ) 2 /(4r-) . (l + o(l)), r</3<3r, 

2-t/- (l + o(l)), /3<r. 

Moreover, for any \t — i*| < c/\/logre, i/iere is a constant C = C(r,f3;c) > 
sucfc ifcat F(t)(l - F(t))/(<&(t) - F(t)) 2 < C ■ F(t* n )(l - F(t* n ))/($(t* n ) - 
F(t* n )) 2 . 

Using 1 — <3?(x) ~ 4>(x) jx for large x, the proof for Lemma 8.1 follows from 
basic calculus and is thus omitted. 

The proof of (8.3) needs careful analysis on \F^ n — F\ and \ fiaf > — [i> n \- The 
following lemmas are proved in [2], Sections 8.5.1 and 8.5.2, respectively. 

Lemma 8.2. For fixed < q < 1, a n = 0(log 3 / 2 n) and t = t n = ^2qlogn+ 
0{l/^EgJn)), we have that |F±(i) - F(t)\ < (an/y/n) ■ y/F{t){l - F(t)) ■ 
(1 + o(l)) over the event A an . 



Lemma 8.3. Suppose F(-) = (1 - £ n )$(-) + - Hn) with e n = n 



-P 



li n = y/2r logn, where 1/2 < J3 < 1, and r > p*((3) so (P,r) falls above the 
detection boundary. Then there is a constant C > such that over event 
A a ™, Aan 0) > /»n and for sufficiently large n, - fi n \ < C ■ {a n /y/n) ■ 

y/F(t j0 )[l ~ F(t jo ))/Mt j0 )-F(t j0 )}. As a result, E[(J$? - M n) • 1{A-}] 2 < 
C ■ (a 2 /n) • F(t i0 )(l - F(t j0 ))/Mt j0 ) - F(t j0 )} 2 . 
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We now proceed to prove (8.3). For short, denote A = A an ,T = tj ,fi = fi n , 
fj, = fia°\ s = e n , i = iaf 1 and = F^. By basic algebra, we can rewrite 

0/ P 1 - $(t)-$(t- M ) . r F(r)-F+(r) $ ( r - M )-$( r - A ) , 

/t > fi over A so the first term < 1 . We then have 



U) {e/e-lf<2 



F(t)-F+(t) 



&(t)-F(t) 
Now, first, by Lemma 8.2, 

'F(r)-F+(r^ 2 



+ 



<&(t - p) - <3?(r - fi) 
$(t) - $(r - /i) 



5.5) 



$(r) - F(t) 



(*(t) - F(t))' 



|$(t-£)-'S>(t-/x)| 



\£i — //I, where 



and second, observe that $(r) _ $(T _ M) $(r) - $(T - M ) 

$77j~ir(7~^ = 0(t — fj,) when /3 > r and = O(l) when /3 < r. So by Lemma 8.3, 



(8.6) 



$(T - An) - *(r - /in) 



< 



$(r) - $(r - /i) 
Clog(n)(o&/n) 
C(a» 



l.-i 



2/ _ /(r)(l-f(T)) 

(*(r) - F(r))> ! 
2/ ^F(r)(l-F(r)) 



(*(r) - F{r)Y ' 



/3>r, 
/3<r; 



inserting (8.5)~(8.6) into (8.4) gives (8.3) and completes the proof of the 
theorem. 



8.2. Proof of Theorem 4.2. The basis strategy underlying the proof of 
Theorem 4.2 is to calculate the Hellinger affinity between pairs of carefully 
chosen probability measures since, as Le Cam and Yang [12] have shown, 
corresponding bounds for the minimax mean squared error easily follow. 
More specifically, let Qg 1 and Qq 2 be a pair of probability measures. The 
Hellinger affinity is defined by A(Qg 1 ,Qg 2 ) = f ^JdQo 1 o]Qq 2 and the minimax 
risk is bounded as 

(8.7) inf sup Eie-Ofy^-e^A^Qe^Qe,). 

e 0e{0i,02> 

The actual implementation of this general strategy in the proof of Theo- 
rem 4.2 requires great care in the choice of the two probability measures and 
involves somewhat delicate calculations of the affinity between these mea- 
sures. Let Xl, .. .,X n P. Let P = (1 - £ , n )N(0, 1) + £o, n N((i 0jn , 1) and 
Pi = (1 — £i tn )N(0, 1) + £i, n N(fj,i jn , 1). We shall write £j for £j jn and //j for 
(ii jn for i = 0,1, and calibrate by £o — n~P , e\ = + (log n) p n~ T with r > j3 
and i < (3 < 1, /^o,n = ^/2rTogn for some r > 0, and = ^/2rTogn — 5 n 
where 5 n is "small" and will be specified later. 
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Denote by Pj jn the joint distribution of X%, . . . , X n under Hi for i = 0, 1. 

Set A n = fJjJv^ogn and A(x) = e (e w ^o /2 - 1) + £l (e^*-^/ 2 - 1) + 

eoei(e /i ° x_At o //2 — l)(e MlX ~ /i i //2 — 1). Then simple calculations show that the 
Hellinger affinity between Pq and P\ satisfies 

A{P ,P 1 )= f VdtWPi= y/l + A(x)<t>(x)dx 

J J — oo 

An f°°} I 

+ / \Jl + A(x)(/)(x)dx. 

-CO J\n ) 

It then follows from the inequalities vT+A >l + iA-|A 2 + 1 ^A 3 - T |gA 4 
and 1 + A(x) > [1 + (e ei) 1/2 (e wa; ^^/ 2 - l)V2( e A«i*-Mi/2 _ i)i/2]2 and some 
algebra that 

A(fb, Pi) > 1 - |Ai - ±A 2 + o{n~ l ), 

where 



+ 2 



$(A„ - no) J 
e J \$(A n -/i )/ 



x ( 1 _ e -(l/8)( M 0-Ml) 2 $(A w -(//Q + /Xl)/2) 

$ 1 / 2 (A„- M )^/ 2 (An-W) 



,2 



A 2 = e 2 e^5$(A n -2 M ) 

l/2\ 2 



X 



(l _ £i e (l/2)(M?-^) p(An-2^) \ 

V £o \$>(\ n - 2n ) J 

+ 2 fl e (l/2)(M?-A^) p(An-2^) \ V2 

£o V*(A n -2ju )/ 



X f 1 - e -(V2)(A*l-A«)) J 



^(An-(Mo + Mi)) 



$ 1 /2( An -2 M )^ 1 / 2 (A„-2/x 1 ) 



Case 1. /3 > 3r. In this case set r = i+r, P=\ and 5 n = (2r) x / 2 x 
n~ T+ ^. With these choices, direct calculations show that A 2 ^> Ai and 
it suffices to focus attention on A 2 in this case. We shall only consider 
the case (3 > 3r as the case /3 = 3r is similar. When (3 > 3r, > 

and A n > 2/ij, z = 0, 1 , for sufficiently large n. Hence A 2 = e^e^Kl — (1 + 
(logn)'n-+0)(l - » 8 n + \5l)f + 2[1 - (1 - i5 2 )]}(l + o(l)) = ^n-i(l + 
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o(l)). Thus A(P ,Pi) > 1 - ±Ai - ±A 2 + o{n- 1 ) = 1 - ^n'^l + o(l)) 
and consequently A(P , n ,Pi, n ) = A n (P ,^i) > (1 - iSF^U + o(l))) n 
e -i/(i6r) ^ q ^gjj follows that the minimax lower bound for estima- 
tion under the mean squared error satisfies inf^ n sup( e?ii/J?i ) gf2?i E(e n — e n ) 2 > 
C(eo,n — £i,n) 2 = C(logn)n~ 1_2r for some constant C > 0. Hence 
m^sup {En:fln)eQn S(£ - l) 2 > C(logn)n^~^. 

Case 2. r < j3 < 3r. In this case set r = ± + - , p = f and 

^ _ (logn)^_ +^ _ ^n(logn) 3 / 4 ra -r+ ^. Note that for sufficiently large n, 
fii < X n < 2/j.i for i = 0,1. In this case Ai and A 2 are balanced. It then 
follows from the standard approximation to the Gaussian tail probability, 
$(x) = -J=- e -(W x2 (1 + o(l)) as x oo, that 



Ai = -e *(A n - Mo) 



(An - Mo) 



x { ([(lognyn-r+P - (X n - Mo)<5„] - t— =— ) 
IV A n - mo / 

X (1+0(1)) 

1 £ 2 2r 5 / 2 



r) 5 



and 



j^ 2 
(2/J, - A„) 

8r 5/2 



A 2 = e 2 e^(X n - 2/i ) , o „ \ (1 + o(l)) 



0F(/3-r)(3r-/?)3 



n^ 1 (l + o(l)). 



Hence A(fb,Pi) > 1 - - §A 2 + o(?i" 1 ) = 1 - cn -1 (l + o(l)), where 
c = 7^&F + TW^W - Therefore A ( J U*.») = An ( p ^Pi) > (1 " 



cn 



-l\n 



e c >0 and consequently inf £n sup (en ^ n)eCn £(§^-l) > C(e , 



£i,n) 2 > C(logn) 5 / 2 n- 1 - 2 ^+(' 3 + r ) 2 /(^). 

Case 3. 0<r. In this case set r = ~ + p = and J n = 0. With these 
choices Mo = Mi an d this case is simpler than the other two cases. It is easy to 
verify that Ai » A 2 and A(P 0>n ,P hn ) = A n (P ,P 1 ) > (1 - cn" 1 )" -» e" c > 
and once again it follows from (8.7) that inf £jj sup( £n p n ) 6 n n E{^- — l) 2 > 
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8.3. Proof of Theorem 5.1. Consider the event = {F~ n (t) < F(t) < 
Fa (i) : VO < t < \J2 log n}. For the first claim, on one hand, the above ar- 
gument shows that e* n < e n over A^ 1 . On the other hand, it follows di- 
rectly from the definition of F^ that Y n < a n over A^ 1 , so by Lemma 3.2, 
P((A^) C ) < P(Y n > an) < 2P(W + > an) < a. Combining these, the first 
claim follows from Lemma 5.1 and the argument right below it in Section 5. 
The second claim follows similarly by using Lemma 3.3. 



8.4. Proof of Theorem 5.2. We give only a sketch of the proof of Theo- 
rem 5.2 since the details in terms of calculating the Hellinger affinity are 
similar to the proof of Theorem 4.2. Without loss of generality assume 
h < 1 < b 2 . Set 



1 

2 +r ' 



P- 



1 r 

2 A 



(2r) 



W + r) 



when P > 3r, 



8r 



P 



5 A 



2 2 



/3-r 
when r < (3 < 3r, 

P = 0,5„ = 
when f3 <r. 



i/2 n - T +p 



(logn) 3 / 4 n 



•r+/3 



For | < [3 < 1 and < r < 1, set (eo,n,Po,n) = {n @, \/2r logn) and (ei, n , 
Mi,n) = (eo,n + c*(logn) p ?i" T ,/i ,n-*n)- It is clear that (eo,n,^o,n) and (ei jn , 
/ii, n ) are both in f2 n . Calculations as given in the proof of Theorem 4.2 
then yield lower bounds on the Hellinger affinity which in turn give upper 
bounds on the L\ distance between Po,n and Pi } n- These bounds show that 
for any given < 7 < ^ one can choose a constant c* > such that the L\ 
distance between the distributions satisfies Li(Po,m Pi,n) < 27. Since i n is 
a (1 — a) level lower confidence limit over Q n , Po,n(£n < £o,n) > 1 — a. It 
then follows that Pi t n(^n < &o,n) > 1 — a — 7 and hence £d,n(£i,n — £n)+ > 
(1 - a - 7)(ei,„ - e ,n) = (1 - a - 7)0* (log n) p n~ r . 



8.5. Proof of Theorem 5.3. We will only show the first claim since the 
proof of the second claim is similar. Let A n be the event that yfn\F n {t) — 
F(t)\/^JF(t){l-F{t)) < 4 v / 2^1og 3/2 (n) for all < t < V21ogn; by Lemma 3.2 
the risk over A c n is negligible. Adapting the notation of the proof of Theo- 
rem 4.1, the key for the proof is that, similarly to the proof of Lemma 4.1, 
especially (8.3) and Lemma 8.1, the following is true for a wide range of a n , 
for example, 0(\/log logn) < a n < A\/2tt log 3 / 2 ' 



n 
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Ca 2 n (logn) 2 -^- 1 " 2 ^ 2 ^, when p > 3r, 

< { Ca 2 n (log n )2.5 n -l+(/3+r) 2 /(4r) ) when r < ^ < 3rj 

Ca 2 (log n)n _1+ ^, when j3<r. 

Using Holder's inequality, and noting that [1 — e* n /e n ] + < [1 — ia° / £ n]+, all 
we need to show is that a n < 0(V21og logn). Choose a* such that P(W* > 
a*) = a/2, compare it with P(W^[ > a n ) = a/2, as < W*, so a n < a*. 
It is well known that a* ~ \/2 log log n for any fixed < a < 1 (see, e.g., [15], 
page 600), so the claim follows directly. 

8.6. Proof of Theorem 6.1. By Lemma 3.2, uniformly, the probability of 
over-estimation will not exceed P{Y n > a n } < 2P{W^ > a n }, which tends 
to by the choice of a n . So it is sufficient to show that (1 — i* an /e n ) + tends 
to in probability uniformly for all (/3,r) £ £1. 

Note that Theorem 5.3 still holds if we replace the sequence a n there by 
the current one. Moreover, the inequality can be further strengthened into 
a constant C(Q) > such that for sufficiently large n 



E 

(8.9) 



e 



< < 



' C(ft)Vloglogra • (logn) 5 / 4 • n -W 2+r -®, 
when [3 > 3r, 
C(0)Vloglogn-(logn) 5 / 4 • n ^/^W+r)y(8r)] ^ 

when r < (3 < 3r, 
C(0)Vloglogn • n-t 1 / 2 "^ , 
when P < r. 

At the same time, note that the exponents are bounded away from 0: 

(8.10) d(0) ee minn{^ +r-(3,^- ^±^, i^} > 0. 

Combining (8.9) and (8.10) yields that E[(l-i* an /e n ) + ] < C(Q) ■ Vloglogn x 
log 1 ' 25 (n) • n~ d ^ for sufficiently large n, so it follows that uniformly (1 — 
e* n /e n ) + tends to in probability. This concludes the proof of Theorem 6.1. 
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